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SUMMARY 



The Problem 

The Denver-Stanford project is involved with teaching Spanish to fifth 
and sixth grade pupils in the Denver Public Schools i and one of its concerns 
has been the evaluation of these pupils' abilities to speak Spanish, when 
the project began in I960, no tests of Spanish speaking ability at the 
elementary school level were available. Project personnel therefore began 
development of speaking tests* 



Results 



Careful review of relevant literature led to the conclusion that 
speaking skill could be broken down into three distinct aspects: the ability 

to pronounce Spauiish sounds properly; the ability to compose Spanish sen- 
tences orally, using correct syntax and grammar; and the ability to communi- 
cate in Spanish with ease and naturalness* To measure these separate aspects, 
speaking tests, composed of phonetic accuracy, structure, and fluency sections 

were constructed* 

The tests were administered by project personnel to pupils selected 
randomly* Each pupil's performance was recorded on magnetic tape, and each 
was in turn evaluated independently by at least two persons* 

Both composite and rater reliabilities were computed in statistical 
evaluation of the tests* Since each test part measured a separate aspect o 
the speaking skill, internal validity varied inversely with composite relia- 
bility* Consequently, a low alpha coefficient, the measure of composite 
reliability, was sought* Rater reliability, on the other hand, reflected the 
extent to which similar scores were assigned each pupil by the separate 
evaluators, and, therefore, a high figure was sought* 

The development process revealed several points to be considered 
constructing a foreign language speaking test* If the test parts are Y 
to reflect different aspects of the speaking skill, they must be evaluated 
separately, and the evaluator must be careful not to be influenced by perfor- 
mance on one section when scoring another* A two- or throe- point scoring 
scale, with each scale position defined by a specific behavioral element, 
seems desirable* Finally, each test part should produce about the same 
mean score and about the same variance to weigh equally in the total test 

score* 

The development was completed during the 1960-61 school year, apd the 
tests have been used in subsequent years and found to be satisfactory. 
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The Denver Public Schools and Stanford University's 
Institute for Communication Research are currently engaged 
in a joint research project on the context of instructional 
television. The purpose of the project is to learn how 
instructional television can best fit into the total teach- 
ing situation. A substantial amo^mt of research has estab- 
lished that television is a very effective teaching medium. 
Ways of combining it with other educational activities must 
now be considered, and the Denver-Stanford project is a 
beginning effort in this direction. Kenneth E. Oberholtzer 
is principal investigator for the Denver Public Schools 
and Wilbur Schramm is principal investigator for Stanford 
University. This is one of a number of project progress 
reports. 



The Problem 

The primary purpose of the Denver-Stanford project is to explore the 
context of instructional television and to improve the effectiveness of 
instruction by changes in context. Elementary school Spanish was chosen 
as the subject matter to be used throughout the project. Therefore, though 
the teaching of Spanish per se is secondary to project aims, it is essential 
to the welfare of pupil participants that the beet teaching methodology in 
this field be utilized. 

In line with the latest findings on language instruction and the recom- 
mendations of those associated with the Foreign Language in the Elementary 
Schools' (FLES) program, the audio-lingual approach has been used exclusively 
during the first year of instruction (fifth grade), and it plays a major 
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role daring the second year (sixth grade) although reading and writing are 
introduced then. 

The first skill which pupils must acquire in this approach is listening 
comprehension — the ability to understand what is said in the second language. 

The second skill is the ability to speak in the second language and to carry 
on meaningful communication. A facility in both listening and speaking must 
be acquired before the child begins to read and write (Brooks* I960, pp. 119-132). 

Measurement is necessary* of course* both to evaluate experimental pro- 
cedures and to determine if the general aims of language instruction are 
being satisfied. Five listening comprehension tests for administration via 
television have been developed by project personnel* and this development — 
which was relatively straight forward — is described in a previous progress 

report (Andrade* Hayman, and Johnson* I 96 I). 

Considerable effort has also gone into the developnent of speaking tests. 
This type of measurement is much more complex* however* for* as has been 
bbserved elsewhere * ’’speaking ability presents the most difficult problem in 
(foreign language) testing” (State Department* 1958* p. l4). Huebener suggests 

the basis of some of these problems as follows: 

Speaking ability is the most difficult phase of a foreign 

language to teach and to acquire. 

This ability is least likely to be retained* for it depends 

on constant practice. 

It is difficult to teach because it requires unusual 

resourcefulness* skill* and energy on the part of the teacher. 

Teaching ability cannot be acquired through a textbook (Huebener, 

1959, P. 8 ). 

Huebener makes it clear that considerable experience and training is 
necessary before a teacher csui adequately teach pupils to speak in a second 
language. And certainly a teacher must be well qualified before he can 
validly judge the speaking performance of others. Even if the teacher is 
sufficiently skilled to evaluate speaking performances* however* test 
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this skill 



administration is difficult. 



Keesee points out that. 
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(speaking ability) is measured only through providing an opportunity for 
the pupil to speak’* (Keesee i 1960| p. 60). Handling pupils individually is 
at best a time consuming, exhausting process, and it requires painstaking 



care to assure similar test conditions for every subject. 

The Denver-Stanford project currently has over 13|000 fifth and sixth 
grade pupils participating, vo-th more than 350 teachers handling classroom 
activities. As in other localities, only a small per cent of teachers have 
the training and experience to qualify them as experts in Spanish. This means, 
therefore, that only a few of those in the project could validly and reliably 
handle the measurement of speaking skills. In light of the necessity for 
such measurement, this situation — combined with the difficulties inherent 
in assessing the ability to speak — has presented a real challenge to 
project personnel. This report describes the attempt to meet this challenge 
in the development of oral measuring instruments, and it discusses the use 
of these instruments in the project. 



Development Criteria 

Tjflnguage Skills . According to MacRae, an audio-lingual language program 
at the elementary school level is built on the following learning experiences!! 
’’Hearing the new language in meaningful patterns, imitating the new sounds by 
rote, speaking the new language in meaningful situations, and recombining 
vocabulary thus acquired in class-originated oral experiences” (MacRae, 1957 i 
p, 24), And., as MacRae says further, ’’The skills that boys and girls in the 
elementary grades may be expected to develop are closely related to the 

learning eaqperiences wc have just noted, . ." ( Ibid . , p. 25)* 

For testing purposes, these skills must, of course, be defined in terms 
of specific behavioral eloments, and, again according to MacRae, they can 
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be defined as the pupil's ability: 

To speak Spanish vidth ease and naturalness and an acceptable 

unanglicized accent. 4.^ 

To have developed the ability to listen care-ul-y enough t 

retain and repeat new sounds. 

To become aware of the mechanics of speaking. 

To realize something of language structure, not grammar as 

such, but that words have different functions to perform as they 

are fitted together to express meaning. 

To have accjuired by ear one of the most important character 

istics of Spanish structure, the agreement of nouns and adjectives 

( Ibid .). 

It seemed to project personnel that these abilities could be measured by 
a test composed of three distinct sections: phonetic accuracy, structure, and 

fluency. The phonetic accuracy section would test the pupil's ability to 
pronounce Spanish words properly and to repeat sounds, the structure section 
would test his ability to use correct syntax and grammar in orally composing 
Spanish sentences, and the fluency section would test his ability to communi- 
cate in Spanish with naturalness and ease. 

Conditions of Administration . Administration of a test of language speak- 
ing skills presents special problems per se. Each subject must be allowed 
an opportunity to perform, but this performance cannot be in the classroom 
since administering the test in the classroom would favor those pupils who 
heard the items several times before their turn to be tested. 

Furthermore, in at least the development and early validating stages of 
a subjectively scored measuring instrument, each pupil's performance should 
be evaluated by two or more persons working independently. The reasons for this 
will be apparent in the next section. The point here is that, though at least 
two evaluations are needed, having two or more evaluators present at the test 
administration would be undesirable because it would involve inefficient use 
of a considerable amount of project personnel time and because it would make 
independent evaluation difficult . 

The solution to these problems seemed to lie in testing pupils individually 
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in a room vdth only the tester and pupil present. Recording the performance 
on magnetic tape would allow independent evaluation at a later date. Andi 
finally! the individual testing situation would allow the tester greater 
control and would assure, within reasonable limits, similar testing conditions 

for all subjects. 

Validity and Reliability . Validity was a difficult problem because of 
the lack of fiui outside criterion against which to compare obtained results* 

In the first place, project personnel were unable to locate a speaking test 
designed for elementary school Spanish. In the second, even if one were 
available, its adequacy, in terms of specific needs of the Denver Public Schools’ 

Spanish program, would be questionable. 

The test should be comprehensive, that is, it should be a representative 
sample of the course content. And, as Keesee has noted, ’’the pupils (should 
be) tested as nearly as possible in the manner in which they have been taught, 

, , , No complicated unfamiliar visual materials should be introduced in a 
test" (Keesee, I960, p. 61), 

The only alternative in this situation is to use construct validity , in 
which the test objectives are ", , . made so explicit that one can determine 
(without empirical demonstration) whether each answer to a test item is a 
behavior belonging in the class (of behaviors) in question" (English and 
English, 1958 , p. 575). The test items were chosen jointly by several persons 
who were thoroughly familiar wit^i course content and objectives. In making 
choices, this group kept in mind the need for comprehensiveness of the test 
as a whole and for preciseness in definition of individual behaviors sought. 

The need for content validity and generally understood principles of 
testing necessarily restrict test content to course content. A test should 
be a representative sample of course content, and as such it will be compre- 
hensive. A test must not go beyond course content, however. The behavioral 
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elements chosen for evaluation! then! were elements which had been taught* 

Reliability had to be approached in a manner different from that normally 
employed. The split-half method could not be used because it requires that a 
test! by some means! be divided into parallel parts! and! according to Guilford! 
»»To be parallel parts! . . . the subtests that compose the parts should have 
items of equal average difficulty! equal spread of difficulty! and equal item 
intercorrelation! and the same amount of time should be devoted to each” 
(Guilford, 1954, p. 377). These conditions would obviously be most difficult 

to satisfy in the proposed speaking test. 

One appropriate method of estimating reliability under these conditions 
is to use Cronbach's generalized equation, which produces what Cronbach has 
named the "coefficient alpha.” The formula for coefficient alpha is: 



2V. 

= (jar) (1 - 



where V. = variance of part I of a test, the size not specified 
1 

V. = variance of total scores 
n = number of parts. 



The alpha coefficient in this case will give a composite reliability, 
which reflects, among other things, the dispersions of the separate components 
of the test and the component intercorrelations . As Guilford states, ’’High 
intercorrelations of components detract from validity of the composite* Where 
validity is at stake for a composite, we would therefore not strive toward 
high composite reliability but the reverse” ( Ibid . , p. 393) • 

Another problem in reliability existed because of the subjectivity in 
evaluating results. In this case, the scorer as well as the test content 
contributes errors of measurement. According to Guilford, the preferred method 
of estimating rater reliability is to correlate scores assigned by different 
persons working individually ( Ibid . , p. 395). One approach to this problem is 









offered through intraolass correlation, for which Ebel has given the following 



formula: 



V - 
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^11 + (k - i;v 

P ® 



where r^^. » the mean reliability of ratings for one rater 

V = variance for persons 
P 

V = varieuace for error 
e 

k = numoer of raters. 

The reliability of the mean of k ratings for each person would be; 

V - 

r = J2 t 

The computation formulae for computing the appropriate variances are 
given on pages 396 and 397 of Guilford's Psychometric Methods (im.). 

Another approach to the problem would be to compute Pearson product moment 
correlation coefficients (r^,,) , though this method ^rould have the disadvantage 
of producing a separate figure for each pair of raters. 

With all of the rater reliability estimates, the object is to produce 
coefficients as high, as possible, that is, to produce maximum agreement among 

raters. 

Reliability estimation for the type of test being developed, then, was 
approached in two distinct ways, each involving a different objective. For 
composite reliability, the objective was to minimize the reliability rating. 
For rater reliability, on the other hand the objective was to maximize the 

rating. 
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The First Trial Test 

Make-up of the Test, A speaking test, with sections indicated in the 
previous section and based on trials with a few children, was constructed in 
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the fall semester of the I96O-6I school year. The test was administered at 
the end of the semester to a random sample of 130 fifth grade children who 
were taking first year Spanish and therefore in the research project. 

In the phonetic accuracy part of the test, the tester spoke the Spanish 
sentence, ”E 1 hi jo pequeno tiene un libro amarillo" (The small boy has a 
yellow book), and the subject then repeated the sentence. This sentence was 
designed to allow the pronunciation of all of the Spanish vowels and the 

consonamts ”n” and ”11” to be evaluated. 

In the structure section, the tester asked the following five questions, 
and the subject was asked to respond in complete Spanish sentences. 

1 . iCorao se llama usted? (What is your name?) 

2 . iQu® articulos de ropa usa un nino? (What articles of clothing 

does a boy wear?) 

3. Digame usted las partes del cuerpo. (Tell me the parts of the body.) 

4 . iQue S6 pone usted en los pies? (What do you put on your feet?) 

3. iCuantos ojos tiene usted? (How many eyes do you have?) 

This section was designed to allow evaluation of pronunciation, syntax, 
structure, extent of vocabulary, spontaneity of response, and appropriateness 
of response. 

In the fluency section, a visual, which clearly showed members of a family, 
parts of the body, and articles of clothing, was displayed. The subject was 
instructed to tell all that he could about the picture in complete Spanish 
sentences. 

Scoring the Test . The test was scored independently by two members of the 
project staff. The scorers went over several sample performances together so 
that their evaluation criteria would be as similar as possible. Then each went 
through the total group of performances without knowing what scores the other 

had assigned. 

The scoring itself was accomplished with a standard rating sheet (Appendix A), 



m 



ERIC 










9 



on which the pupil performance on each test item was rated from excellent 
to poor on a five-point scale. The scorer would first listen to, the complete 
performance and then listen- to and evaluate the separate sections. If the 

I 

scorer was uncertain as to the exact scale position to be marked for a partic- 
ular itemt he would consider the child’s performance on the skill measured by 

this item in other parts of the test. 

Test Reliability . The composite reliability for the test, as measured 

by the alpha coefficient, was: 

^ = . 740 . 

In light of the desire for validity and therefore low composite reliability,- 
this alpha coefficient seemed too high. It would indicate, among other things, 
high intercorrelations between test parts. These inter correlations plus the 
correlations of each test part and the total test with the first semester 
listening comprehension test (the measure of the understanding skill) are 
shown in table 1 . Pearson product moment correlations are given in this table. 



Table 1 

CORRELATIONS BETWEEN SPEAKING TEST PARTS, TOTAL SPEAKING 
TEST, AND LISTENING COMPREHENSION TEST 





—FIRST SEMESTER I96O-6I 




Test part 


Structure Fluency 


Total Test 


Listening 

Comprehension Test 


Phonetic 

Accuracy 


.502 .487 


VO 

• 


M3 


Structure 


.368 


.908 


.662 


Fluency 




.812 


.740 


Total Test 


* 




.714 



Table 1 shows rather high intercorrelations between test parts, and it 
indicates that the structure section was doing about the same thing as the 
test as a whole. The correlations of test parts with the total are spurious, 
of course, because each part contributes to the total. To determine the 
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correlation between each part and the total, with the influence of that part 
removed, part-whole correlations were computed (McNemar, p. 164)< The part- 
whole correlation of phonetic accuracy with the total was .53^, of structure 
with the total was .653i and of fluency with the total was .68?. Again the 
figures indicate high relationships, probably higher than would be expected 
if the test parts were really measuring different aspects of the speaking 
skill as desired. In this respect, a need for improvement was definitely 

indicated. 

Rater reliabilities for the test were surprisingly high. Table 2 shows 
rater reliabilities for each part of the test and the total test in terms of the 
three coefficients discussed earlier. As mentioned previously, two raters were 
used. Rater reliability, then, seemed to be satisfactory. 

In table 2, and r^^ are the same in each comparison. This suggests 
that the product moment correlation, r^^, is a special case of r^^^ where only 
two raters are involved. As proved in Appendix C, this is only true if the 
variance of scores assigned by both raters is the same. Test part means and 
variances for each rater are shown in table 3» Though the means differ some- 
what, variances are indeed quite similar, and this explains the similarity 

between and r^^. 

Table 2 

RATER RELIABILITY ON THE 
1960-61 FIRST SEMESTER SPEAKING TEST 

Reliability Coefficient 



Test 

Part 


^kk 


^11 


^ab 


Phonetic 

Accuracy 


.966 


.935 


.935 


Structure 


.971 


.943 


.943 


Fluency 


.861 


.756 


.756 


Total 

Test 


.976 


.952 


.952 



o 
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Table 3 

MEANS AND VARIANCES FOR EACH RATER 
ON THE 1960-61 FIRST SEMESTER SPEAKING TEST 



Test 

Part 


Rater 


Mean 


Variance 


Phonetic 


A 


16.563 


33.989 


Accuracy 


B 


17.126 


34.15a 


Structure 


A 


18.650 


65.675 


B 


19:116 


66:335 




A 


12.971 


32.801 


Fluency 


B 


8.283 


22.222 


Total 


A 


^8.582 


234.182 


Test 


B 


44.582 


238.424 



Though the reliabilities were quite satisfactory overall t table 3 shows 
that rater B gave somewhat higher scores on the average for phonetic accuracy 
and structure than rater A, while rater A gave higher scores on the fluency 
section* The raters could not be expected to give identical scores in each 
sectiony of course y nor would the differences between them always be in the 
same direction. Under ideal conditions y however y differences would be at the 
chance level y vdiich is definitely not the case in the fluency section. 

The Second Trial Test 

Make - up of the Test . The test make-up was revised during the second 
semester in light of statistics compiled on the first trial test and of ex- 
periences of the evaluators in both the administration and scoring of the first 
test* In addition y the new test covered course content from the complete year 
rather than just the first semester. 

The evaluators found in administering the phonetic accuracy section of the 
first test that the sentence to be repeated was too longy and that many pupils 
consequently forgot a word or two. This caused them to lose points even if 
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the sounds they remembered and did pronounce were quite accurate t Therefore, 
a shorter sentence, *'Es una senorita" (It is a young lady), was used. As 
before, the tester spoke the sentence, and the subject repeated it. This 
sentence allowed evaluation of all of the Spanish vowels. 

In the structure section, the evaluators found in the first test that the 
rather general questions asked allowed too many possible valid responses. This 
section therefore was more rigidly structured in order to predetermine and 
limit the possible responses. The tester supplied the subject with vocabulary, 
not in syntactically correct order, needed to construct a sentence. Each word 
was established independently with visuals. A picture of a man was used to 
establish the noun and article, "el padre" (the father). A picture of a boy 
wearing shoes was used to establish the verb suid object, "usa zapatos" (wears 
shoes). And two strips of black paper were used to establish the plural of 
the adjective, "negros" (black). Then the subject was shown a picture of a 
man wearing black shoes and asked to construct a sentence from the established 
vocabulary \diich would describe the visual. The correct response would be, 

"El padre usa zapatos negros" (The father is wearing black shoes). The vocabu- 
lary used has been thoroughly covered in the course. It was specifically 
» 

established here so that only the child's ability to arrange the word in correct 
order would be measured. 

As in the previous section, the evaluators felt that the stimuli provided 
in the fluency section did not structure the possible responses sufficiently . 
Therefore, instead of four visuals which the subject was asked to describe in 
complete Spanish sentences, four different tasks were required. One of these 
involved a visual to be described, as before. The others involved? (1) asking 

the question, "iComo se llama usted?" (What is your name?), for which there is 

» 

only one answer; (2) handing the child an apple and asking, "^Que tiene usted 
en la mano?" (What do you have in your hand?); and (3) displaying a visual and 
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asking, ’’iQuienes pagan for los comestibles?” (Who pays for the groceries?)* 

The test was administered, on an individual basis, to 200 randomly- 
selected pupils at the end of the semester. 

Scoring the Test . To preserve the integrity of each test part, that is, 
to make each test part reflect a specific aspect of the speaking skill and not 
be influenced by other test parts, the parts were scored separately. The evalua- 
tor would listen to the phonetic accuracy section, for example, as many times as 
he liked in making his judgment, but he would not listen to a succeeding section 
until scores for the one in question were assigned* And he would try not to be 
Influenced in his present evaluation by the child's performance in preceding 
sections, though preserving such independence of thought in actual practice is 
difficult. 

In addition, the evaluators felt that the five-point scale used in the 
first test demanded finer discrimination in judgment than could validly be 
made* Consequently, the five-point scale was abauidoned and three- and four- 
point scales were adopted. In another refinement, each scale position was 
precisely defined, as opposed to the first test in which each scale represented 
a range from very poor to very good without any specific behavioral element 
indicating a certain scale position. 

In the phonetic accuracy section, each vowel was rated as follows: 

2 = accurate reproduction, 

1 = inaccurate reproduction; 

0 = no production. 

In the structure section, the scale scores were: 

2 =: complete sentence, syntactically correct 

1 s incomplete sentence, syntactically correct 

0 = sentence not syntactically correct or no measurable response. 
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In the fluency section, the scoring was as follows: 

• In answering the question, *'iC6rao se llama usted?" 

3 = ”Me llamo » or "Yo me llamo ” 

2 s “Se llamo ” or name only 

1 = '»Se llamo es "Me llamo es ," or any other combina- 

tion in which the name is stated 

0 = inappropriate response or no response, 

• In answering the question, "iQue tiene usted en la mano?" 

3 = correct response, given naturally 

2 = correct response, but with slight, unnatural hesitation 

1 = correct response, but given in a very slow, uncertain manner 

0 > incorrect response or no response 

• In answering the question, "iQuienes pagan por los comestibles?" 

2 * correct use of the two articles required in responding 

1 s correct use of one of the two articles required in responding 

0 « incorrect use of both articles or no response, 

• In describing the visual: 

2 = correct verb and correct form 

1 = correct verb but incorrect form 
0 = incorrect verb and form or no response. 

The scoring form for this test is shown in Appendix B, 

Test Reliability . The composite reliability for the second semester test, 
as measured by the alpha coefficient, was: 

oC = .401 

This appeared to be a much more satisfactory figure than the ,740 obtained 
for the first test. It indicated, among other things, that the separate parts 
of the test were measuring the different aspects of speaking ability as in- 
tended. Inter correlations of test parts and correlations of test parts with 
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the total speaking test and vdth the listening comprehension test are shovm in 
table 4. 



Table 4 

CORRELATIONS BETWEEN SPEAKING TEST PARTS, TOTAL SPEAKING 
TEST, AND LISTENING COMPREHENSION TEST 
—SECOND SEMESTER OP 196O-6I 



Test 

Part 


Structure 


Fluency 


Total Test 


Listening 

Comprehension 

Test 


Phonetic 

Accuracy 


.180 


.253 


,Z36 


.358 


Structure 




.419 


.562 


00 

00 

0 

• 


Fluency 






.951 


.680 


Total test 








.715 



Table 4, compared to table 1, shows a definite drop in intercorrelations 
of test parts. Correlations of test parts with the total test also went down, 
except for the fluency section, and, although the correlation of each part 
with the listening comprehension test was lower than before, the speaking test 
total correlated almost exactly the same (.715 vs. .714) with the listening 
comprehension test. Therefore, though this speaking test as a whole seemed to 
be measuring about the same skill as the first, each part was now doing its 
specific job more accurately. 

The high correlation of the fluency section with the total test was dis- 
turbing, however. The fact that fluency did not correlate highly with the 
other test parts suggested that its high correlation with the total score was 
an artifact of the heavy weight given fluency in scoring. A pupil could get 
a msocimum of 10 points on the phonetic accuracy section, two points on structure, 
and 16 points on fluency. The part-whole correlations support this explanation. 
For phonetic accuracy, the part- whole correlation with the total test was -.013; 
for structure, it was .396; and for fluency, it was .454. The drop from .951 
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to *454 on fluency indicates its heavy influence on the total score* 
Rater reliabilities are given in table 5* 

Table ^ 

RATER RELIABILITY ON THE 
1960-61 SECOND SEMESTER SPEAKING TEST 



Reliability Coefficient 



Test 

Part 


^kk 




^ab 


Phonetic 

Accuracy 


.682 


.517 


.532 


Structure 


.893 


.807 


*807 


Fluency 


.989 


.979 


*980 


Total Test 


.984 


.969 


*971 



Again) the rater reliabilities seemed highly satisfactory* Compared to the 
first semester test (shown. in table 2)) the reliabilities for the phonetic ac- 
curacy and sti^cture sections went down a bit) while those for the fluency 
section and the total test went up* 

Means and variances for the two raters on each test part are given in table 
6* On this test) the differences between means were at chance level) which 
should be one result of the more specific definition of scale positions in scor- 
ing. Variances differed more than on the previous test) however) and this is 
reflected by the differences between r^^ and r^^ in table 5* Even on phonetic 
accuracy) however) where the ratio of variances between raters was about three 

to two, r„ and r ^ differed by only *015; in most situations) these two rater 
11 ab 

reliability measures will apparently give about the same result if only two 



raters are used. 
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Table 6 

MEANS AND VARIANCES FOR EACH RATER 
ON THE 1960-61 SECOND SEMESTER SPEAKING TEST 



Test 


Part 


Rater 


Mean 


Variance 


Phonetic 


A 


9.020 


1.059 


Accuracy 


B 


9.570 


.654 




A 


1.210 


.796 


Structure 




B 


1.005 


. 77 ^ 




A 


3.880 


12.426 


Fluency 




B 


3.575 


11.533 




A 


14.110 


20.160 


Total 


Test 


B 


14.150 


17.707 



Table 6 raises a point about the usefulness of the phonetic accuracy 
section of the test. Since the maximum possible score was ten, the evaluators 
scored the average pupil about 93 per cent accurate on this part of the test, 
and the small variance shows that most pupils did this well. It was stated 
previously that the fluency section was unduly influencing the total score, yet 
table 6 indicates that it contributed only about a third as much as phonetic 
accuracy to the total. This apparent paradox is explainable through the high 
variance on fluency and very low variance on phonetic accuracy. Since each 
pupil was scoring about the same on phonetic accuracy, adding scores from this 
section to the total amounted to linearly transforming the total score. This 
would not affect the correlation of the total score with any other variable, 
and it would make the correlation of phonetic accuracy with the other test parts 
negligible. The part-whole correlation of phonetic accuracy with the total 
score was, in fact, -.013, which is not significantly different from zero. 

Subsequent Tests 

The statistical analysis of the second semester test showed that two 
further improvements were needed. The phonetic accuracy section needed revising 



o 
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so that it would more accurately discriminate between pupils on their ability to 
pronounce Spanish sounds. This was accomplished in subsequent tests by elimi- 
nating the vowel '*0*” which is pronounced in Spanish the same way it is in 
English) by adding three or four of the difficult Spanish consonants) and by 
attempting to define the scale positions so that differences between excellent) 
fair) euid poor pronunciation could be more accurately determined. 

The structure section also needed a change. It worked fine S6) but 
its contribution to the total score was too small. This was overcome by using 
two sentences) that iS) by doubling the size of the section) and by evaluating 
verb and adjective endings as well as syntax. (Vocabulary was established the 
same way as before) except for the precise verb and adjective forms to be used.) 
Finally) the weight of the fluency section was reduced by changing from a 
ISiree- to a two-point scale) and thiS) of course) increased the relative weight 
of the structure section. 

Sixth grade speaiking tests were also developed. They employed the same 
general format and scoring procedure as the fifth grade tests ) but they were 
built around sixth grade course content. Both fifth and sixth grade tests 
have been used in the I96I-62 and 1962-63 school years. At present) only the 
1961-62 results have been analyzed. The tests have seemed to work well in every 
respect. Each part contributes about the same amount to the total score) inter- 
correlations among parts are loW) and rater reliabilities) with three raters 
used) have averaged about .930. More important perhaps ) different parts of 
the tests have revealed significant differences between methods of teaching 
elementary school Spanish. 

Summary and Conclusions 

Development of speaking skills is an important part of foreign language 
instruction) and adequate evaluation of a foreign language program depends in 
part on the measurement of speaking skills. The Denver-Stanford project is 
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involved with teaching Spanish to fifth and sixth grade pupils in the Denver 
Public Schools, and one of its concerns has been the evaluation of these 
pupils' abilities to speak Spanish. Since no tests of speaking ability at the 
elementary school level were available when the program started in I960, 
project personnel began the development of speaking tests. 

Careful review of FLES recommendations and of relevant literature led to 
the conclusion that the speaking skill could be broken down into three distinct 
aspects: the ability to pronounce Spanish sounds properly? the ability to 

structure Spanish sentences correctly; and the ability to communicate in 
Spanish with ease and naturalness. To measure these separate aspects 
phonetic accuracy, structure, and fluency — - speaking tests were constructed. 

The tests were administered to random samples of pupils by project 
personnel. Each pupil's performance was recorded on magnetic tape, and each 
was in turn evaluated independently by at least two persons. 

Statistical evaluation in the development process was limited entirely to 
internal validity and reliability. External validity was necessarily of the 
construct type since no outside criterion, against which to compare obtained 
results, was available. Both composite and rater reliabilities were computed. 
Since each part of the test measured a separate aspect of the speaking skill, 
internal validity would vary inversely with composite reliability. Consequently 
a low alpha coefficient, the measure of composite reliability, was sought. 

Bater reliability, on the other hand, reflected the extent to which similar 
scores were assigned each pupil by the evaluators, and, therefore, a high 
figure was sought. The three measures of rater reliability used were Ebel's 
mean for k ratings, his mean for one rater, r^^, and the Pearson product 

moment correlation coefficient, r^^. 

The development process revealed several points to be considered in 
constructing a foreign language speaking test. If the test parts are really 



20 



to reflect different aspects of the speaking skill, they must be evaluated 
separately , and the evaluator must be careful not to be influenced by performance 
on one section when scoring another, A five-point rating scale for specific 
test items, such as pronunciation of a vowel, demanded finer discrimination 
than the evaluators felt they could validly make , and a two- or three-point 
scale was found more desirable. Also, the evaluators felt they could make better 
judgments if each scale position was defined by a specific behavioral element. 
Finally , each test part should produce about the same meauri score and about the 
same variance to weigh equally in the total test score. 

The tests have been used in subsequent years and have been found satis- 
factory, both in terms of test criteria and in terms of differentiating 
between methods of teaching Spanish. 




Appendix A 
Speaking Test Form 



21 



Student 

Research Group Assignment 



Phonetic 


Accuracy 








A 


5 


4 


3 


2 


1 


E 


5 


4 


3 


2 


1 


I 


5 


4 


3 


2 


1 


0 


5 


4 


3 


2 


1 


U 


5 


4 


3 


2 


1 


n 


5 


4 


3 


2 


1 


11 


5 


4 


3 


2 


1 



Section 1 score 



II, Structure 

Sound ( correct 5 

pronunciation) 
Order (syntax) 5 

Form (structure) 5 

Choice (vocabulary) 5 



4 3 2 1 
4 3 2 1 
4 3 2 1 
4 3 2 1 



Spontaneous response 
Appropriate answer 



5^32 

5 3 2 



1 

1 Section 2 score 



III. Fluency 

Expression of ideas 5^32 

Naturalness of 5^32 

utterances 

Vocabulary usage 5^32 

Sentence structure 5^32 



Section 3 score 



Total score 



o 



ERIC 



Appendix B 

SPEAKING TEST SCORING FORM 
REVISED 



Student 



School 



Research Group Assignment 



Student No. 



Section A. Phonetic Accuracy 

E 2 

U 2 

A 2 

0 2 

I 2 

Section B. Structure 

Order (syntax) 2 

Section C. Fluency 

1. Appropriate answer 3 

2. Spontaneous response 3 

3. Accuracy in 

article agreement 

4. Accuracy in verb usage 



1 0 

1 0 

1 0 

1 0 

1 0 

Section A. Score 



1 0 

Section B. Score 



2 10 

2 10 

2 10 

(a) 2 1 0 

(b) 2 1 0 

(c) 2 1 0 

(d) 2 1 0 

Section C. Score 

TOTAL SCORE 



Comments 
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Appendix C 

RELATIONSHIP OF and 

WHERE TWO RATERS ARE USED AND VARIANCE IS EQUAL 



11 



Consider the Ebel mean reliability coefficient of ratings for one rater, 
V - V 

= y " suid let SA = the sum of scores assigned by one rater and SB ^ 

p ^e 



the sum of scores assigned by the other rater. By definition, 



df 

P 



2d' 



E(A-t-B)^ 

k 



(SA ->• SB)^ 
kN 



(N - 1) 

^,2 . ^2 2(A+B)^ (SA)^ + (SB)^ . (HA + SB)^ 

SA + SB 1.T ~~ + i,M 



V 



df 



e 



k N 

(k - 1)(N - 1) 



kN 
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where k = number of raters 

N s number being rated 

With k = 2, 

S(A ^ B)" - - a" - a" ^ (ai)^^ .. (S B)f 

(N - 1) 

SA^ + a2 - (SA)^ * (a)^ 

(N - 1) 



SA^ + SB^ + 2SAB - ^ — + _ 2 SA a 3 _ j^2 

N 



^ (SA)^ -f (SB)^ 
N 



SA^ + SB^ - (SA)^ + (SB)^ 



N 



2SAB - 



2 SASB 

“ N 



[SA^ - . CSB^ - 
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If the scores assigned by each rater have equal variance « then the two 
denominator terms are equal, and the equation reduces to: 

^ N2AB " IA2B 
CnSA^ - (SA)^] 

The Pearson product moment correlation coefficient between scores 
assigned by two raters is: 



Eab 



ab 



NJ 0. 
a b 



^y substitution, this reduces to the familiar computational formula: 



NZAB - ZASB 



ab 



V - (EA)^ V NSB^ - (2B)^ 



If the scores assigned by each rater have equal variance, then the two 
denominator tezms in this equation are also equal, and the product moment 
correlation reduces to: 

p ^ NSAB - SASB 
[NSA^ - (SA)^] 

This was exactly the same result obtained when the Ebel coefficient was 
reduced under these conditions. Therefore, with two raters and equal variance 
of scores asfidgned by the raters, r^^^ * r^^. 
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